\documentclass[a4paper, 12pt]{report}
\usepackage{amsmath, amsthm, amsfonts} % depended by \DeclareMathOperator, \newtheorem, \proof, \mathbb
\usepackage{array}
\usepackage{ifthen}
\usepackage{mathrsfs} % for mathscr
\usepackage[utf8]{inputenc} % utf8 encoding
\usepackage{listings} % for lst in color box
\usepackage[most]{tcolorbox} % for messageshell colorbox
\usepackage{algorithm} % depended by algpseudocode
\usepackage[noend]{algpseudocode} % pseudocode

\usepackage{hyperref} % add clickable link to tableofcontents
\usepackage{enumitem} % use enumerate
\usepackage{float} % for [H] label in figures
\usepackage{url} % for citing wiki
\usepackage{intcalc} % modulo macro?

\usepackage{fancyhdr} % add header and footer
\pagestyle{fancy}
\fancyhf{}
\rhead{\leftmark}
\cfoot{\thepage}

\graphicspath{{img/}}

\usepackage{geometry} % set margin
\geometry{left=3cm, right=2.5cm, top=2.5cm, bottom=2.5cm}
\lstset{
  basicstyle=\ttfamily,
  columns=fullflexible,
  breaklines=true,
}
\newtheorem{theorem}{Theorem}[chapter]
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}{Definition}[chapter]

\newcounter{i}

\title{Big Data Analysis Report(incGM+)}
\date{}
\author{Shitong CHAI}

\begin{document}

\maketitle
\tableofcontents

\chapter {Theory}
All the algorithms(incGM, incGM+) in this paper\cite{incGM} are based on the metric Minimum Image Based support(MNI)\cite{mni} and the concept of fringe subgraphs in subgraph space.

MNI is defined as follows:
\begin{definition}
    (MNI metric\cite{grami}) Let $f_1,\cdots,f_m$ be the set of isomorphisms of a subgraph $S(V_S,E_S)$ in a graph $G$. Also let $F(v)=\{f_1(v),\cdots,f_m(v)\}$ be the set that contains the distinct nodes in $G$ whose functions $f_1,\cdots,f_m$ map a node $v\in V_S$. The minimum image based support(MNI) of $S$ in $G$, denoted by $MNI_G(S)$, is defined as $MNI_G(S)=\min\{t:\ t=|F(v)|\ \forall v\in V_S\}$
\end{definition}

The main idea of incGM+ is rather simple based on the important property of MNI metric that keeps the extended graphs of infrequent subgraph infrequent, which is formalized by the following theorem.
\begin{theorem}
    (Property of MNI\cite{grami}) For a graph $G$, we have a tree $T$  whose nodes are subgraphs of $G$, say $n_T\in G$. And the tree is constructed by extending root with $n\in V(G), n\notin V(root)$ recursively. If we have subgraphs $S_1$ and $S_2$, subgraph $S_1$ is a child of subgraph $S_2$ and $S_2$ is infrequent in MNI metric, then $S_1$ is also infrequent in MNI metric.
\end{theorem}
Based on the theories in the above, we can work in the space of subgraphs of graph $G$, denoted by $\mathscr{G}=\{G[S]:\ S\subseteq V(G)\}$. In the subgraph space $\mathscr G$, there are some subgraphs that are frequent, and there are some subgraphs that are infrequent, which is a bipartition of subgraph space $\mathscr{G}$. 

Based on the \emph{Property of MNI}, all the extended graphs of infrequent subgraphs are infrequent, so there must be at least one subgraph that is the minimal one in size in the subgraph space $\mathscr{G}$, which is called Minimal InFrequent Subgraphs(MIFS). Symmetrically, we have
Maximal Frequent Subgraphs(MFS) which is the subgraphs which is maximal in size in the subgraph space $\mathscr{G}$.

The author of incGM+ also implemented a inverted index data structure to compute the MNI table which is aggregated in the data structure called Fast Embeddings Lookup Store(FELS)\cite{incGM}. The MNI table is computed efficiently be recursively searching by the inverted index. However, this inverted index structure can be useful to reorder search process to optimize the algorithm only when the algorithm used for evaluating the frequentness of a subgraph is
Grami-like\cite{grami}, that is, by solving a Constraint Satisfaction Problem of MNI.

The Grami algorithm\cite{grami} treats the problem of deciding if a subgraph $S$ of graph $G$ is frequent as an equivalent CSP where the variable $x_v\in \mathcal X$ is correspondent to node $v$ in $S$ and each domain $d\in\mathcal D$ for $x_v$ is a subset of $V(G)$. For every variable $x_v$ corresponding to a node in $S$, the variable can only take the value of a distinct node in $G$, that is, $|\{x_v:\ \forall x_v\in \mathcal X\}|=|V(S)|$. If any constraint cannot be satisfied, then the subgraph is infrequent.

The Grami can be used as the EVALUATE function in the official algorithm given in the incGM+ paper, the FELS data structure and inverted index is designed for it and can help pruning search space of Grami significantly and increase the speed of the incGM+ algorithm. However, I assume the EVALUATE function to be any function to test if a subgraph is frequent. So I implemented it by myself with the VF2 algorithm\cite{VF2}. The implementation of VF2 used in my code can be found
on the official website of Networkx.

\chapter {Implementation}
Based on the ideas of MNI, MFS and MIFS and VF2 algorithm in Networkx, I implemented the incGM+ algorithm in python.

This is my implementation of \emph{EVALUATION} function used to test whether $G[S_{nodes}]$ is frequent where $G$ is the global graph to perform subgraph mining on, $S_{nodes}$ is the node set of induced subgraph $S$, $\tau$ is the support threshold.
\begin{lstlisting}[language=python, frame=single]
def EVALUATE(G, tau, S_nodes):
    if not nx.is_connected(G.subgraph(S_nodes)):
        return False
    components = nx.connected_components(G)
    for cnodes in components:
        gm = isomorphism.GraphMatcher(G.subgraph(cnodes),G.subgraph(S_nodes))
        for i in gm.subgraph_isomorphisms_iter():
            fels_dict.add(i)
            if fels_dict.is_frequent(S_nodes, tau, G):
                return True
    return False
\end{lstlisting}

This is my implementation of \emph{SEARCHLIMITED} which is used to locally find embeddings for $S_{nodes}$, based on a small search region extended by the newly added edge (or graph).
\begin{lstlisting}[language=python, frame=single]
def SEARCHLIMITED(S_nodes,search_region,G):
    embeddings = False
    if not nx.is_connected(G.subgraph(S_nodes)):
        return False
    while len(search_region) < len(S_nodes):
        old_region = search_region
        search_region = neighbor(search_region,G)
        if old_region == search_region:
            break
    gm = isomorphism.GraphMatcher(G.subgraph(search_region), G.subgraph(S_nodes))
    for i in gm.subgraph_isomorphisms_iter():
        embeddings = True
        fels_dict.add(i)
        if fels_dict.is_frequent(S_nodes, tau,G):
            break
    return embeddings
\end{lstlisting}

This is my implementation of \emph{UPDATEFRINGE} which is used to update fringe by adding to MFS or MIFS, or removing from MFS or MIFS.
\begin{lstlisting}[language=python, frame=single]
def UPDATEFRINGE(fringe, S_nodes, isFreq, tau, G):
    deleted = False
    if isFreq:
        added = fringe.addMFS(S_nodes)
        deleted = fringe.removeMIFS(S_nodes)

        for i in range(len(fringe.MFS)):
            MFSi = fringe.MFS[i]
            if len(MFSi) == len(S_nodes) and MFSi != S_nodes:
                u = MFSi | S_nodes
                if u not in fringe.MIFS:
                    if not EVALUATE(G,tau,u):
                        joined = fringe.addMIFS(u)
    return deleted
\end{lstlisting}

And finally this is the implementation of the main algorithm incGM+.
\begin{lstlisting}[language=python, frame=single]
def incGM_plus(G, fringe, tau, newgraph):
    G.add_edges_from(newgraph.edges)
    newnodes = frozenset(newgraph.nodes)
    fringe.addMIFS(newnodes)
    i = 0
    while 0 <= i <len(fringe.MIFS):
        S_nodes = fringe.MIFS[i]
        embeds = SEARCHLIMITED(S_nodes, newnodes,G)
        if not embeds:
            isFreq = EVALUATE(G,tau,S_nodes)
        else:
            isFreq = fels_dict.is_frequent(S_nodes, tau, G)
        delete = UPDATEFRINGE(fringe, S_nodes, isFreq, tau, G)
        i = i + 1 - int(delete)
    return fringe.MFS
\end{lstlisting}

The data structures are based on the paper, \emph{FRINGE} stores MFS and MIFS for $G$, \emph{FELS\_dict} stores the mapping from a subset of $V(G)$ to the corresponding \emph{FELS} data structure and the \emph{FELS} contains a MNI table and an inverted index.
\begin{lstlisting}[language=python, frame=single]
import networkx as nx
from networkx.algorithms import isomorphism
def neighbor(a,G):
    # a is a node set
    # neighbor of a in G
    return frozenset(G.edge_subgraph(G.edges(a)))

class FRINGE:
    def __init__(self):
        self.MFS = []
        self.MIFS = []
    def addMFS(self, mfs):
        if mfs not in self.MFS:
            self.MFS.append(mfs)
            return True
        return False
    def addMIFS(self, mifs):
        if mifs not in self.MIFS:
            self.MIFS.append(mifs)
            return True
        return False
    def removeMIFS(self, mifs):
        if mifs in self.MIFS:
            self.MIFS.remove(mifs)
            return True
        return False

class MNI_table:
    def __init__(self,S_nodes):
        self.nodes = S_nodes
        self.table = dict()
        for i in self.nodes:
            self.table[i]=dict()
        self.supp = dict()
    def add(self, dic):
        for i in self.nodes:
            if dic[i] in self.table[i].keys():
                self.table[i][dic[i]] += 1
            else:
                self.table[i][dic[i]] = 1
            self.supp[i] = len(self.table[i].values())
    def support(self):
        v = self.supp.values()
        return min(v) if v else 0
    def frequent(self, tau):
        return self.support()>=tau

class FELS:
    def __init__(self,S_nodes):
        self.S_nodes = S_nodes
        self.mni = MNI_table(self.S_nodes)
        self.embeddings = []
        self.inverted_index = dict() # dict of set of node sets(subgraph)
    def add(self, embedding):
        self.mni.add(embedding)
    def add_inverted(self, embedding):
        emb_nodes = frozenset(embedding.keys())
        for i in embedding.keys():
            if i not in self.inverted_index.keys():
                self.inverted_index[i] = set()
            self.inverted_index[i].add(emb_nodes)


class FELS_dict:
    def __init__(self):
        self.elements = dict()

    def add(self,G2S_embedding):
        # index only by node is enough
        S2G_embedding = {v: k for k, v in G2S_embedding.items()}
        S_nodes = frozenset(S2G_embedding.keys())
        subG = frozenset(G2S_embedding.keys())
        if S_nodes not in fels_dict.keys():
            self.elements[S_nodes] = FELS(S_nodes)
        if subG not in fels_dict.keys():
            self.elements[subG] = FELS(subG)
        self.elem(S_nodes).add(S2G_embedding)
        self.elem(S_nodes).add_inverted(G2S_embedding)
        self.elem(subG).add(G2S_embedding)
        self.elem(subG).add_inverted(S2G_embedding)
        return

    def keys(self):
        return self.elements.keys()

    def elem(self, S_nodes):
        return self.elements[S_nodes]

    def is_frequent(self, S_nodes, tau,G):
        if not nx.is_connected(G.subgraph(S_nodes)):
            return False
        return self.elem(S_nodes).mni.frequent(tau)

    def iso_graphs(self, S_nodes):
        gs = set()
        inv = self.elem(S_nodes).inverted_index
        for i in inv.keys():
            for j in inv[i]:
                gs.add(j)
        return gs
\end{lstlisting}

\chapter{Result}
To verify the correctness of my implementation, I randomly generated this graph as the base graph $G$.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth]{base}
\caption{Base graph $G$}
\end{figure}

The generated frequent subgraph patterns reduced by isomorphism from the fringe set MFS is as follows, 37 patterns in total with minimum support $\tau=7$.
\begin{figure}[H]
\centering
    \begin{tabular}{c p{3cm}p{3cm}clc} 
        \setcounter{i}{0}
        \whiledo{\value{i} < 37} {
            \includegraphics[width=0.13\textwidth]{\thei} 
            \ifthenelse{\intcalcMod{\value{i}}{4} = 3}{\\}{&}
            \stepcounter{i}
        }

    \end{tabular}
    \caption{Frequent patterns generated (Continued)($\tau=7$)}
\end{figure}

And the time cost of adding an edge with respect to the size of the base graph (by edge) is shown below.

\begin{figure}[H]
\centering
\includegraphics[width=0.8\textwidth]{time_cost}
\caption{Time cost with respect to graph size}
\end{figure}

From the figure we can observe that the VF2 algorithm may not be suitable for the \emph{EVALUATE} function for incGM+. The reordering optimization cannot be performed here, which may be the reason why the time complexity is so high.

For distributed subgraph mining,  the file name of base graphs $G_i$ and its content will be used as key value pairs to mapper, the mapper can use this algorithm to output the MFS with its local supports as mapped key value pairs. And the reducer will reduce by merging support values, compute the global support and output the frequent subgraphs. This can be done easily with MapReduce programming model in Hadoop. Just the same method as normal distributed subgraph mining
taught in the class. 
\bibliography{report_bib}{}
\bibliographystyle{plain}
\end{document}
